Significant Feature Clustering
نویسنده
چکیده
In this thesis, we present a new clustering algorithm we call Significance Feature Clustering, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is significantly negative, neutral, or significantly positive, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter α, where α controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document’s significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good α value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an α as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as internal similarity, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call maximum tree internal similarity, we show that a close to optimal text clustering may be picked from any number of clusterings created by different α’s. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.
منابع مشابه
Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملSteel Consumption Forecasting Using Nonlinear Pattern Recognition Model Based on Self-Organizing Maps
Steel consumption is a critical factor affecting pricing decisions and a key element to achieve sustainable industrial development. Forecasting future trends of steel consumption based on analysis of nonlinear patterns using artificial intelligence (AI) techniques is the main purpose of this paper. Because there are several features affecting target variable which make the analysis of relations...
متن کاملطبقهبندی بیماری پارکینسون بر مبنای شاخصهای درون-ناحیهای و بین-ناحیهای شبکه حرکتی مغز با استفاده از دادگان fMRI حالت استراحت
Parkinson’s disease (PD) is a progressive neurological disorder characterized by tremor, rigidity, and slowness of movement. Recent studies on investigation of the brain function show that there are spontaneous fluctuations between regions at rest as resting state network affected in various disorders. In this paper, we used amplitude of low frequency fluctuation (ALFF) for the study of intra-r...
متن کاملMLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection
Multi-label classification has gained significant attention during recent years, due to the increasing number of modern applications associated with multi-label data. Despite its short life, different approaches have been presented to solve the task of multi-label classification. LIFT is a multi-label classifier which utilizes a new strategy to multi-label learning by leveraging label-specific ...
متن کاملSupervised Feature Extraction of Face Images for Improvement of Recognition Accuracy
Dimensionality reduction methods transform or select a low dimensional feature space to efficiently represent the original high dimensional feature space of data. Feature reduction techniques are an important step in many pattern recognition problems in different fields especially in analyzing of high dimensional data. Hyperspectral images are acquired by remote sensors and human face images ar...
متن کامل